100 research outputs found
exploiting parallelism in many core architectures lattice boltzmann models as a test case
In this paper we address the problem of identifying and exploiting techniques that optimize the performance of large scale scientific codes on many-core processors. We consider as a test-bed a state-of-the-art Lattice Boltzmann (LB) model, that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equations of state of a perfect gas. The regular structure of Lattice Boltzmann algorithms makes it relatively easy to identify a large degree of available parallelism; the challenge is that of mapping this parallelism onto processors whose architecture is becoming more and more complex, both in terms of an increasing number of independent cores and – within each core – of vector instructions on longer and longer data words. We take as an example the Intel Sandy Bridge micro-architecture, that supports AVX instructions operating on 256-bit vectors; we address the problem of efficiently implementing the key computational kernels of LB codes – streaming and collision – on this family of processors; we introduce several successive optimization steps and quantitatively assess the impact of each of them on performance. Our final result is a production-ready code already in use for large scale simulations of the Rayleigh-Taylor instability. We analyze both raw performance and scaling figures, and compare with GPU-based implementations of similar codes
The three dimensional Ising spin glass in an external magnetic field: the role of the silent majority
We perform equilibrium parallel-tempering simulations of the 3D Ising
Edwards-Anderson spin glass in a field. A traditional analysis shows no signs
of a phase transition. Yet, we encounter dramatic fluctuations in the behaviour
of the model: Averages over all the data only describe the behaviour of a small
fraction of it. Therefore we develop a new approach to study the equilibrium
behaviour of the system, by classifying the measurements as a function of a
conditioning variate. We propose a finite-size scaling analysis based on the
probability distribution function of the conditioning variate, which may
accelerate the convergence to the thermodynamic limit. In this way, we find a
non-trivial spectrum of behaviours, where a part of the measurements behaves as
the average, while the majority of them shows signs of scale invariance. As a
result, we can estimate the temperature interval where the phase transition in
a field ought to lie, if it exists. Although this would-be critical regime is
unreachable with present resources, the numerical challenge is finally well
posed.Comment: 42 pages, 19 figures. Minor changes and added figure (results
unchanged
Critical parameters of the three-dimensional Ising spin glass
We report a high-precision finite-size scaling study of the critical behavior
of the three-dimensional Ising Edwards-Anderson model (the Ising spin glass).
We have thermalized lattices up to L=40 using the Janus dedicated computer. Our
analysis takes into account leading-order corrections to scaling. We obtain Tc
= 1.1019(29) for the critical temperature, \nu = 2.562(42) for the thermal
exponent, \eta = -0.3900(36) for the anomalous dimension and \omega = 1.12(10)
for the exponent of the leading corrections to scaling. Standard (hyper)scaling
relations yield \alpha = -5.69(13), \beta = 0.782(10) and \gamma = 6.13(11). We
also compute several universal quantities at Tc.Comment: 9 pages, 5 figure
Janus II: a new generation application-driven computer for spin-system simulations
This paper describes the architecture, the development and the implementation
of Janus II, a new generation application-driven number cruncher optimized for
Monte Carlo simulations of spin systems (mainly spin glasses). This domain of
computational physics is a recognized grand challenge of high-performance
computing: the resources necessary to study in detail theoretical models that
can make contact with experimental data are by far beyond those available using
commodity computer systems. On the other hand, several specific features of the
associated algorithms suggest that unconventional computer architectures, which
can be implemented with available electronics technologies, may lead to order
of magnitude increases in performance, reducing to acceptable values on human
scales the time needed to carry out simulation campaigns that would take
centuries on commercially available machines. Janus II is one such machine,
recently developed and commissioned, that builds upon and improves on the
successful JANUS machine, which has been used for physics since 2008 and is
still in operation today. This paper describes in detail the motivations behind
the project, the computational requirements, the architecture and the
implementation of this new machine and compares its expected performances with
those of currently available commercial systems.Comment: 28 pages, 6 figure
Prospects for at CERN in NA62
The NA62 experiment will begin taking data in 2015. Its primary purpose is a
10% measurement of the branching ratio of the ultrarare kaon decay , using the decay in flight of kaons in an unseparated
beam with momentum 75 GeV/c.The detector and analysis technique are described
here.Comment: 8 pages for proceedings of 50 Years of CP
Externalities and the nucleolus
In most economic applications, externalities prevail: the worth of a coalition depends on how the other players are organized. We show that there is a unique natural way of extending the nucleolus from (coalitional) games without externalities to games with externalities. This is in contrast to the Shapley value and the core for which many different extensions have been proposed
Early Experience on Porting and Running a Lattice Boltzmann Code on the Xeon-Phi Co-Processor
In this paper we report on our early experience on porting, optimizing and benchmarking a Lattice Boltzmann (LB) code on the Xeon-Phi co-processor, the first generally available version of the new Many Integrated Core (MIC) architecture, developed by Intel. We consider as a test-bed a state-of-the-art LB model, that accurately reproduces the thermo-hydrodynamics of a 2D- fluid obeying the equations of state of a perfect gas. The regular structure of LB algorithms makes it relatively easy to identify a large degree of available parallelism. However, mapping a large fraction of this parallelism onto this new class of processors is not straightforward. The D2Q37 LB algorithm considered in this paper is an appropriate test-bed for this architecture since the critical computing kernels require high performances both in terms of memory bandwidth for sparse memory access patterns and number crunching capability. We describe our implementation of the code, that builds on previous experience made on other (simpler) many-core processors and GPUs, present benchmark results and measure performances, and finally compare with the results obtained by previous implementations developed on state-of-the-art classic multi-core CPUs and GP-GPUs
Benchmarking GPUs with a parallel Lattice-Boltzmann code
Accelerators are an increasingly common option to boost performance of codes that require extensive number crunching. In this paper we report on our experience with NVIDIA accelerators to study fluid systems using the Lattice Boltzmann (LB) method. The regular structure of LB algorithms makes them suitable for processor architectures with a large degree of parallelism, such as recent multi- and many-core processors and GPUs; however, the challenge of exploiting a large fraction of the theoretically available performance of this new class of processors is not easily met. We consider a state-of-the-art two-dimensional LB model based on 37 populations (a D2Q37 model), that
accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equation-of-state of a perfect gas.
The computational features of this model make it a significant benchmark to analyze the performance of new computational platforms, since critical kernels in this code require both high memory-bandwidth on sparse memory addressing patterns and floating-point throughput.
In this paper we consider two recent classes of GPU boards based on the Fermi and Kepler architectures; we describe in details all steps done to implement and optimize our LB code and analyze its performance first on single-GPU systems, and then on parallel multi-GPU systems based on one node as well as on a cluster of many nodes; in the latter case we use CUDA-aware MPI as an abstraction layer to assess the advantages of advanced GPU-to-GPU communication technologies like GPUDirect.
On our implementation, aggregate sustained performance of the most compute intensive part of the code breaks the double-precision Tflops barrier on a single-host system with two GPUs
- …